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Foundations  of  a  New  Test  Theory 


Abstract 

It  is  only  a  slight  exaggeration  to  describe  the  test  theory 
that  dominates  educational  measuremen'  today  as  the  application  of 
twentieth  century  statistics  to  nineteenth  century  psychology. 
Sophisticated  estimation  procedures,  new  tcchnques  for  missing- 
data  problems,  and  theoretical  advances  into  latent -variable 
modeling  have  appeared -- a  1 1  applied  with  psychological  models  that 
explain  problem-solving  ability  in  terms  of  a  single,  continuous 
variable.  This  caricature  suffices  for  many  practical  prediction 
and  selection  problems  because  it  expresses  patterns  in  data  that 
are  pertinent  to  the  decisions  that  must  be  made.  It.  falls  short 
for  placement  and  instruction  problems  based  on  students'  internal 
representations  of  systems,  problem-solving  strategies,  or 
reconfigurations  of  knowledge  as  they  learn.  Such  applications 
demand  di f  ferent  caricatures  of  abi 1 i ty - -  more  realistic  ones  that 
can  express  patterns  suggested  bv  recent  developments  in  cognitive 
and  educational  psychology.  The  application  of  modern  statistical 
methods  with  modern  psychological  models  constitutes  the 
foundation  of  a  new  test  theory. 

Key  Words:  Cognitive  psychology 
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Introduction 


Educational  measurement  faces  a  crisis  today  that  would 
appear  to  threaten  its  very  foundations.  The  essential  problem  is 
that  the  view  of  human  abilities  implicit  in  standard  rest 
theory- -item  response  theory  as  well  as  classical  true-score 
theory- -is  incompatible  with  the  view  rapidly  emerging  fLoin 
cognitive  and  educational  psychology.  Learners  increase  their 
competence  not  by  simply  accumulating  new  facts  and  skills,  but  bv 
reconfiguring  their  knowledge  structures,  by  automating  procedures 
and  chunking  information  to  reduce  memory  loads,  and  bv  developing 
strategies  and  models  that  tell  them  when  and  how  facts  and  skills 
are  relevant.  The  types  of  observations  ana  the  patterns  in  a^t. 
that  reflect  the  ways  that  students  think,  perform,  and  learn 
cannot  be  accommodated  by  traditional  models  and  methods.  To  some 
it  would  seem  to  some  that  psychometrics  has  little  to  offer  in 
the  quest  to  apply  this  new  knowledge  to  the  practical  educational 
problems  of  the  individual,  the  classroom,  or  the  nation  (Hunt  and 
MacLeod .  1978)  . 

I  concur  that  the  standard  methods  of  test  theory  do  not 
suffice  for  solving  problems  cast  in  the  framework  of  what  we  are 
learning  about  how  people  acquire  knowledge  and  competence,  but  I 
cannot  agree  that  psychometrics  has  nothing  to  offer. 

Standard  test  theory  evolved  as  the  application  of 
statistical  theory  with  a  simple  model  of  ability  that  suits  the 
decision-making  environment  of  most  mass  educational  systems. 
Broader  educational  options,  based  on  insights  into  the  nature  of 


learning  and  supported  by  more  powerful  technologies,  demand  a 


broader  range  of  models  of  capabi 1  i t ies - - s t i  1 1  simple  compared  to 
the  realities  of  cognition,  but  capturing  patterns  that  inform  a 
broader  range  of  alternatives.  A  new  test  theory  can  be  brought 
about  bv  applying  to  well-chosen  cognitive  models  the  same  general 
principles  of  statistical  inference  that  led  to  standard  test 


theory  when  applied  to  the  simple  model. 
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The  first  half  of  this  paper  sketches  the  evolution  of 
standard  test  theorv,  highlighting  the  challenges  that  spurred 
each  new  advance.  The  challenges  that  cognitive  and  educational 
psvchologv  present  today  are  then  discussed,  and  a  framework  for 
responding  to  that  challenge  is  outlined.  Directions  for  needed 
development  are  exemplified  with  current  work. 

The  Early  Context  of  Educational  Decisions 

The  kinds  of  decisions  that  shaped  the  evolution  of  classical 
test  theorv  were  nearly  universal  in  education  at  the  beginning  of 
this  century,  and  dominate  practice  yet  today.  They  were  born  of 
the  constraints  educators  encountered  as  they  launched  their 
campaign  to  provide  education  on  a  broader  scale  than  had  ever 
been  attempted  hitherto: 

"...the  demand  for  tests  arose  during  the  period  when 
school  attendance  was  made  compulsory  and  when  higher 
education  was  developing  its  strengths.  Educators  faced 
the  unprecedented  dilemma  of  dealing  with  the  range  and 
diversity  of  abilities  and  backgrounds  chat  individuals 
bring  to  schooling.  They  needed  ways  of  determining 
which  children  and  youths  would  be  able  to  profit  from 
some  form  of  instruction  as  given  in  ordinary  school  and 
college  practices  as  designed  essentially  for  the 
majority  of  the-  population."  (Glaser,  1981,  p.  924). 

Educators  were  confronted  with  selection  or  placement  decisions 
for  largo  numbers  of  students.  Resources  limited  the  information 
thev  could  gather  about  each  student,  constrained  the  number  of 
options  thev  could  offer,  and  precluded  tailoring  programs  to 
individual  students  once  a  decision  was  made. 

A  first  example  is  selecting  applicants  into  a  college  that 
presents  the  same  material  in  the  same  way  to  all  students.  There 
is  only  one  treatment,  and  the  alternatives  are  to  accept  or 
reject.  The  admissions  officer  would  prefer  to  accept  those  who 
are  likelv  to  succeed.  When  resources  permit  more  than  one 
decision  option,  the  usual  generalization  of  the  accept/reject 
paradigm  is  to  offer  a  sequence  of  alternatives,  each  more 
demanding  than  the  next.  Placing  high  school  freshmen  into 
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academic  Cracks  is  an  example  of  this  latter  type.  Problems  of 
selection  into  a  single  program  and  of  placement  into  a  single 
sequence  are  both  decisions  about  "linearly  ordered  options." 

Exposing  a  diverse  group  of  students  to  a  uniform  educational 
treatment  typicallv  produces  a  distribution  of  outcomes  (Bloom, 
1976).  An  individual's  degree  of  success  depends  on  how  his  or 
her  unique  skills,  knowledge,  and  interests  match  up  with  the 
equally  multifaceted  requirements  of  the  treatment. 

At  costs  substantially  lower  than  personal  interviews  or 
performance  samples,  responses  to  mul t ipl e - cho ice  test  items 
provide  information  about  certain  aspects  of  this  matchup.  What 
is  necessary  is  that  each  item  tap  some  of  the  skills  required  for 
success .  Even  though  a  single  item  might  require  only  a  few  of 
the  relevant  skills  and  offer  little  information  in  its  own  right, 
a  tendency  to  pro”ide  correct  answers  over  a  large  number  of  items 
supports  some  degree  of  prediction  of  success  (Green,  1978).  If 
all  candidates  are  administered  the  same  items,  and  one  wishes  to 
predict  success  in  1  inear ly - ordered  options,  their  number-correct 
scores  can  be  used  (Dawes  and  Corrigan,  1979).  Even  though  the 
several  students  at  a  given  score  level  possess  different 
constellations  of  skills,  abilities,  and  backgrounds,  making  the 
same  decision  for  all  of  them  among  the  available  alternatives  is 
often  about  as  well  as  can  be  done  with  the  available  data. 

Once  the  test,  and  the  linearly-ordered  options  are  specified, 
making  decisions  from  test  performances  requires  nothing  more 
complicated  than  adding  up  numbers  of  correct  responses.  Two 
different  tests  constructed  for  the  same  decision,  however, 
invariably  line  up  examinees  differently  as  they  draw  upon 
different  particular  skills  from  the  myriad  of  those  potentially 
informative.  Additional  statistical  machinery  is  required  to 
guide  one  in  constructing  tests  and  evaluating  their  quality 
Classical  test  theory  was  a  first  response  to  these  needs. 


Classical  Test  Theory 

Charles  Spearman  (1904a,  1904b,  1907.  1910,  1913)  is  credited 
with  the  central  idea  of  classical  test  theory  (CTT):  a  test  score 
can  be  viewed  as  the  sum  of  two  components,  a  "true"  score  and  a 
random  "error"  term.  Two  similar  ("parallel")  tests  are 
considered  to  reflect  the  same  true  score,  but  disagree  about  an 
examinee's  observed  scores  because  of  the  error  components -- the 
variance  of  which  can,  under  the  assumptions  of  CTT,  be  driven  to 
zero  by  just  making  the  tests  long  enough.  Ideally  decisions 
would  be  based  on  true  scores;  in  practice  they  must  be  based  on 
observed  scores.  "Reliability."  the  degree  to  which  the 
unobservable  true  scores  account  for  the  variance  in  observed 
scores,  gauges  the  accuracy  with  which  a  test  lines  up  a  group  of 
examinees- -a  reasonable  criterion  for  the  quality  of  a  test  if  it 
is  assumed  that  the  items  tap  appropriate  skills  and  scores  will 
be  used  to  decide  among  linearly  ordered  options. 

Upon  these  notions  was  founded  a  practicable  testing 
methodology.  Reliability  became  a  paramount  measure  of  the 
quality  of  a  test,  although  of  couise  reliability  had  to  be 
complemented  with  validity  measures  such  as  the  correlation 
between  test  scores  and  subsequent  performance.  Validity  studies 
had  less  influence  on  test  construction,  however,  because  they 
arrive  too  late  in  the  process -- only  after  the  test  has  been 
.oHm  i  nistv  red  and  examinees  have  been  followed  over  time.  To 
ohtain  high  reliability,  one  uses  items  that  would  be  answered 
correctly  bv  about  half  the  examinees,  for  example,  and  avoids 
items  that  would  have  low  correlations  with  the  total  test  scores. 

Note  chat  these  dicta  could  guide  test  construction  solely 
from  counts  and  patterns  of  right  and  wrong  responses  to  candidate 
test  items- -  ignoring  both  the  content  of  the  items  and  the 
contemplated  decision  alternatives.  Of  course  good  test 
construction  does  consider  the  knowledge,  skill,  and  strategy 
requirements  of  items.  The  point  is  that  these  considerations  lie 
outside  the  realm  of  the  classical  test  theory.  Test  developers 
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use  them  independently  of,  sometimes  in  contradiction  to,  what 
test  theory  tells  them. 

Building  upon  Spearman's  foundation,  psychometric ians 
developed  a  vast  armamentarium  of  techniques  for  building  and 
using  tests  (Gulliksen,  1950).  such  as  approximating  reliability 
from  the  internal  consistency  of  items  within  a  test  (Kuder  and 
Richardson,  1937)  and  estimating  validity  without  knowing 
subsequent  performances  of  rejected  examinees  (Kelley,  1923'. 
Over  time,  a  rigorous  axiomatic  foundation  was  laid  for 
statistical  inference  under  the  aegis  of  CTT  (Lord,  1959;  Novick 
1966;  Lord  and  Novick,  1968).  The  simple  partitioning  of  observ 
scores  into  true  and  error  components  was  generalized  to  multipl 
sources  of  variation  from  items,  persons,  and  observational 
settings,  and  the  full  power  of  analysis  of  variance  was  hi  ought 
to  bear  upon  decision-making  problems  usinf,  test  scores  (Cronbac 
Gleser .  Nanda,  and  Rajaratnam,  19  7? ;  Lord  and  Novick,  l°h8,i 

A  source  of  dissatisfaction  with  CTT  early  on  was  that  its 
characterizations  of  examinees,  such  as  total  score  and  percent; 
rank,  and  of  items,  such  as  percent -correct  and  item-tost 
correlation,  are  confounded  descriptions  of  the  particular  items 
that  constitute  a  test  and  a  particular  group  of  examinees  who 
takes  it  (Wright,  1968).  if  one  test  consists  of  easier  items 
than  a  second  otherwise  similar  test,  examinees'  scores  on  the  t 
tests  are  not  directly  comparable  and  score  distributions  have 
different  shapes.  If  a  test  is  administered  to  groups  of 
examinees  that  differ  in  proficiency,  item  percent s - cor roc t  and 
item- test  correlations  differ.  When  many  tests  could  he 
constructed  for  the  same  purpose,  differing  perhaps  in  difficult 
or  length,  should  not  there  be  a  wav  to  elid'Cn.  tv  r  izo  t — ’'""'•ys 
independently  of  the  test  they  took,  and  items  independently  of 
the  examinees  who  took  them? 

In  attitude  measurement,  where  agreements  to  a  topic  are 
analogous  to  correct  answers  to  test  questions,  L.L.  Thurstone 
(1928)  expressed  the  following  desideratum;  "If  a  scale  is  to  h 
regarded  as  valid,  the  scale  values  of  the  statements  should  not 


hr  ,i  f  f  ec  t i'il  hv  the  opinions  of  the  people  whose  r-.  -spouse  he!; 
to  const  rue  r  it."  Thu  rs  tone  (  ] r)  2  !> ")  and  K.L.  Thorndike  (Thomdik- 
et  al  .  ,  I'frh)  pioneered  efforts  to  relate  test  scores  to 
t  »•  oho  logical  traits,  us  ing  item  percent  s -correct  and  assumption-, 
a'niut  di  s  t  r  i  hut  i  ons  of  traits  to  transform  scores  from  different 
tests  onto  the  same  scale. 

Thurstone  and  Thorndike  scaling,  despite  allusions  to  an 
underlying  t  r.i  i  t  .  remained  essentially  theories  for  scores  .  a  1  h«  ; 
transformed  (with  t  lu‘  aid  of  untestable  assumptions)  to  permit 
comparisons  across  nonpara  l  le  1  tests.  F’svcho  i  og.i  ca  !  traits  par  s 
appear  as  exp  l  i  i  it  parameters  in  the  models  of  Ferguson  i  1  i  . 
hawlev  I  1‘til)  ,  and  Tucker  (  19<»6 )  .  These  researchers  studied,  test 
construction  problems  within  CTT  bv  making  an  assumption  hevond 
those  of  CTT  proper;  namely,  that  aside  from  random  factors,  item 
responses  were  driven  hv  a  utiohse rvah  1  e  ability  variable  A 
second  gent-rat  ion  of  test  theory  began  to  take  form  as  at  ten;  ion. 
shifted  from  test  scores  as  the  object  of  inference,  to 
unobservable  variables  hypothesised  to  have  produced  them. 

Item  Response  Theory 

Item  l espouse  theory  (IRT).  or,  "latent  trait  theory,"  as  it 
was  called  then,  appears  as  a  test  theory  in  its  own  right  in  the 
work,  of  Frederic  Lord  and  Georg  Rasch  (1961.)).  Like 

classical  test  theory,  IRT  concerns  examinees'  overall  prof,  tc  ienc 
in  a  domain  of  tasks.  Hut  while  CTT  makes  no  statement  about  the 
it i  chan  i  sms  that  gi"e  rise  to  performance,  IRT  posits  a  single, 
ui  mb  se  r  vat >  1  e  ,  prof  i  e  i  enc  v  va  r  i  ab  1  e  . 

At  the  heart  of  IRT  is  a  mathematical  model  for  the 
pmhahilitv  that  a  given  person  will  respond  correct lv  to  a  given 

It  ciassic.il  test  theory  olfers  a  statistical  model  for 
test  scores  without  psychological  model,  (.'ut  (man's  (l^am 
scalirn’,  feiliniijues  offer  a  psychological  model  without  a 
st. -it  1st  i  c  a  1  inidi  1  .  Important  in  the  reconcept  ua  1  i  vat  ion  of  the 
it-eani  nr.  of  test  ■■cures,  a  Cut  t  man  scale  can  be  viewed  as  t  he 
limit  i  ng,  rase  in  IRT  in  which  each  i  t  em  is  perfect  lv  i -if  orma  t  i  ”e 
aliout  win  then  an  examinee's  d-il  i  tv  lies  above  or  hi- 1  ow  a  specif  j 
point  on  at;  . th  i  1  i  t  v  e  out  i  tuium  . 
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item,  a  function  of  that  person's  proficiency  parameter  and  one  or 
more  parameters  for  the  item.  The  item's  parameters  express 
pr  ^erties  such  as  difficulty  or  sensitivity  to  proficiency.  Tin 
item  response,  rather  than  the  test  score,  is  the  fundamental  unit 
of  observation.  If  an  IRT  model  holds,  responses  to  any  subset  of 
items  support  inferences  on  the  same  scale  of  measurement 

This  conceptualization  opens  the  door  to  solving  many 
practical  testing  problems  that  were  difficult  under  CTT ,  such  as: 

Test  construction  (Birnbaum,  1968;  Theunissen,  1985).  If 
item  parameters  are  available  for  a  collection  of  items,  tests  can 
be  constructed  for  optimal  performance  in  specific  applications, 
such  as  minimizing  classification  errors. 

Adapt i ve  test i  mt  (Lord.  1980,  Chapter  10;  Weiss,  1989).  An 
adaptive  testing  scheme  selects  the  best  item  to  administer  next 
to  an  examinee,  based  on  the  amount  of  information  that  various 
available  items  would  provide  and  a  provisional  estimate  of  the 
examinee's  proficiency  from  responses  to  items  given  thus  far. 

Educ  a  t  i  ona  1  asses  sine  n  t  (Bock,  Mis  lew,  and  Woodson,  198?  : 
Choppin,  19  7  6;  Messick,  Beaton,  arid  Lord,  1  983  ).  Assessments 
gauge  proficiencies  at  the  level  of  populations  rather  than 
individuals,  to  evaluate  programs  and  monitor  trends.  IRT  makes 
it  possible  to  establish  a  stable  measurement  scale  while  allowing 
assessment  instruments  to  evolve  over  time. 

Ibid  work  assumed,  for  t  fie  most  part,  tha:  the  IRT  mode  1  was 
known  and  correct,  and  that  true  values  or  accurate  estimates  of 
item  parameters  were  available.  Current  IRT  research  emphasizes 
integral  ing  IRT  into  the  general  framework  of  statistic.il 
inference,  and  .  pairing  an  understanding  of  just  when  and  how  IRT 
models  are  appropriate. 

Statistical  Inference  in  Item  Response  Theory 

F.arlv  applications  of  IRT  were  designed  more  to  demonstrate 
its  potential  than  to  solve  actual  measurement  problems.  Data 
were  gathered  with  tests  written  according  to  CTT  dicta;  the  same 
long  tests  were  administered  to  man’.'  examinees,  and  each  item  had 


passed  CTT  quality  cnecks.  1 11  istraLive  purposes  were  served 
adequately  by  rough  estimation  procedures  that  treat  point 
estimates  of  examinee-  and  item-parameters  as  if  thev  were  the 
parameters  themselves,  ignoring  the  uncertainty  associated  with 
the  estimates.  These  approximations  break  down  when  IRT  is 
applied  beyond  the  usual  limits  of  CTT  testing,  as  when  examinees 
are  presented  only,  sav,  fifteen  items  in  adaptive  testing  or  five 
in  educational  assessments  (Mislevy,  1988).  In  response,  IRT 
researchers  have  turned  to  two  active  lines  cf  research  in 
statistics:  missing  data  methods  and  Bayesian  estimation. 

Missing  data  metnods  aie  relevant  because  a  latent  variable 
such  as  an  IRT  examinee  proficiency  parameter  can  br  viewed  as  a 
datum  whose  value  is  missing  for  everyone.  General  results  on 
estimating  parameters  when  some  data  are  missing,  such  as 
Dempster,  Laird,  and  Rutin's  (1977)  EM  algorithm,  have  led  to 
methods  of  item  parameter  estimation  that  are  at  once  rigorous  and 
eficient  (e.g  .  Eock  and  Aitkin,  198C;  Tsutakawa,  1984).  Results 
on  statistical  infor. nation  in  missing  data  problems  yield  insights 
into  the  uncertainty  structures  of  IRT  parameters  (Mislevy  and 
Sheehan,  in  press;  Mislevy  and  Wu,  1988)  and  offer  wavs  of 
increasing  accuracy  by  exploiting  collateral  information  about 
items  and  examinees  (Mislevy,  1987,  1988a). 

The  Bayesian  perspective  confronts  uncertainty  head  on, 
expressing  what  is  known  about  parameters  as  probability 
distributions.  When  these  distributions  are  concentrated,  the 
expedient  of  using  point  estimates  as  if  they  were  the  true 
parameters  can  give  acceptable  results  in  subsequent  analyses. 

But  when  the  distributions  are  diffuse,  one  must  propagate  the 
uncertainty  into  subsequent  analyses  to  obtain  correct  inferences. 
Statistical  reasoning  along  these  lines  was  proposed  as  far  back 
as  1927  by  Kelley  (1927),  and  championed  by  Novicx  in  the  1970's 
(e.g. ,  Novick  and  Jackson,  1974),  but  only  now  are  the  ideas 
gaining  currency.  In  this  framework,  one  can  determine  when  the 
standard,  simpler,  approximations  suffice,  but  use  (admittedly 
more  complex)  correct  analyses  when  thev  don’t.  For  examples  in 
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IRT  estimation  problems,  see  Bock  and  Aitkin  (1981)  on  item 
parameters,  Misievy  (1988b)  on  proficiency  distribut ions ,  and 
Tsutakawa  and  Soltys  (1988)  on  individuals'  proficiencies. 


The  Question  of  Model  Fit 

But  of  course  the  IRT  model  is  never  exactly  correct  A 
single  variable  that  accounts  for  all  nonrandomness  in  examinees' 
responses  is  not  a  serious  representation  of  cognition,  but  a 
caricatui  e  that  can  solve  applied  problems  when  it  captures  the’ 
patterns  that  are  salient  to  the  job.  The  pattern  that  CTT  and 
IRT  can  capture  is  examinees'  tendencies  to  give  correct 
responses,  which  can  usefully  inform  decisions  about  linearly 
ordered  alternatives.  IRT  was  a  practical  advance  beyond  CTT 
because  it  provides  information  about  overall  proficiencies  in 
more  flexible  ways.  It  was  a  conceptual  advance  because  it 
provides  a  framework  for  detecting  anomalies  in  the  "overall 
proficiency"  paradigm, 

This  can  be  illustrated  with  Rasch's  (1960)  model  for 
right/wrong  items,  supposing  for  convenience  all  examinees  are 
presented  the  same  test.  Under  CTT,  all  examinees  with  a  given 
total  score  would  be  treated  alike.  Under  the  Rasch  model,  all 
examinees  with  the  same  score  would  receive  the  same  ability 
estimate  ,  and  might  also  be  treated  al ike  - -depending  on  an 
analysis  of  model  fit.  Combining  an  examinee's  proficiency 
estimate  with  an  item's  difficulty  estimate,  the  Rasch  model 
states  how  likelv  a  correct  response  would  be  if  the  single- 
proficienev  conception  of  ability  were  true.  The  items  that  high 
scorers  missed  should  usually  be  easv  ones,  and  the  items  low 
scorers  got  right  should  be  easv  ones.  Finding  that  these 


Under  other  IRT  models  such  as  the  2-  and  3 -parameter 
logistic  models,  examinees  with  the  same  total  score  need  not 
receive  exac 1 1 v  the  same  ability  estimate,  but  usually  receive 
similar  estimates.  Correlations  between  total  scores  and  IRT 
estimates  in  typical  educational  tests  are  usually  above  .96,  and 
few  decisions  would  be  made  differently  with  any  IRT  model,  or,  if 
everyone  has  taken  the  same  test.,  even  with  CTT. 
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patterns  hold  supports  making  the  same  decisions  about  people  with 
same  scores,  because,  to  an  approximation,  they  got  the  same 
itemc  right  and  the  same  ones  wrong.  Total  scores,  and  thus  Rasch 
ability  estimates,  convey  nearly  everything  these  data  have  to  sav 
about  comparing  these  examinees. 

To  the  extent  that  high  scoring  examinees  miss  items  that  are 
generally  easy  and  low  scoring  examinees  get  hard  ones  right, 
neither  total  scores  nor  IRT  ability  estimates  may  be  capturing 
all  the  systematic  information  in  the  data.  Analyses  of  an 
individual's  unexpected  responses  can  reveal  misconceptions  or 
atypical  patterns  of  learning  (Mead,  1976;  Smith,  1986;  Tatsuoka. 
1983).  To  understand  these  patterns  one  must  look  bevond  the 
simple  universe  of  the  IRT  model --to  the  content  of  the  items,  the 
structure  of  the  learning  area,  the  pedagogy  of  the  discipline, 
and  the  psychology  of  the  problem  solving  tasks  the  items  demand. 

Now,  patterns  in  responses  other  than  overall  level 
proficiency  can  have  educational  and  psychological  meaning,  but 
yet  hold  no  salience  for  a  particular  decision.  If  overall 
proficiency  in  a  domain  of  items  suffices  for  a  particular 
decision,  as  can  be  the  case  with  linearly  ordered  educational 
options,  cross-current  patterns  constitute  data  variation  that 
need  not  be  explicated.  This  is  the  essence  of  statistical 
modeling:  expressing  the  patterns  that  are  dominant  and  meaningful 
in  terms  of  model  parameters,  and  allowing  for  departures  from 
these  patterns  in  terms  of  distributions  of  residuals.  But  if  the 
decision  does  depend  on  the  cross-current  patterns,  in  addition  to 
or  instead  of  overall  proficiency,  neither  CTT  nor  standard  IRT 
may  be  the  right  tool  for  the  job. 

The  issue  of  model  fit,  then,  is  more  pragmatic  than 
statistical,  since  lack  of  fit  must  be  judged  in  practice  by  the 
nature  and  the  magnitude  of  the  errors  it  causes.  An  IRT  model 
might  bt  satisfactory  for  selecting  honors  math  students,  for 
example,  if  people  with  similar  scores  have  similar  chances  of 
success -- even  though  examinees  with  similar  scores  have  different 
profiles  of  skills  and  knowledge.  The  profile  differences  could 
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be  modeled  as  "noise"  without  harm  for  the  selection  dec  is i on- -  but 
probably  not  for  advising  individual  examinees  which  topics  to 
study  to  maximally  increase  their  sc'  res. 

Measuring  learning  is  one  application  where  IRT  models  can 
fail,  because  they  accommodate  only  a  highly  constrained  type  of 
change:  an  examinee's  chances  of  success  on  all  items  must 
increase  or  decrease  by  exactly  the  same  amount  (in  an  appropriate 
metric).  A  single  IRT  model  applied  to  pretest  and  posttest  data 
cannot  reveal  how  different  students  learn  different  topics  to 
different  degrees- -patterns  that  could  be  at  the  crux  of  an 
instructional  decision. 

Testing  and  Learning 

Good  "macro- level"  decisions  to  place  students  into 
appropriate  educational  programs  are  important  in  increasing  the 
quality  of  education,  but  thev  are  not  sufficient.  Tracking 
students  as  they  progress  opens  the.  door  to  finer  grained  "micro¬ 
level  "  decisions  to  enhance  learning  along  the  way.  Good 
decision-making  at  this  level  requires  an  inferential  framework 
built  around  an  understanding  of  how  students  learn. 

A  picture  of  a  learner  that  is  consistent  with  standard  test, 
theory  is  that  of  a  collector  of  facts  and  skills,  adding  each  to 
his  repertoire  more  or  less  independently  of  others.  Recent 
developments  in  psychology  sketch  a  markedly  different  picture, 
reflecting  the  astounding  capabilities  and  the  surprising 
limitations  of  the  mind- -  1 ightning  fast  recognition  of  stored 
patterns  and  creative  applications  of  heuristic  strategies,  on  the 
one  hand;  vet  with  short,  term  memory  capacities  of  only  about 
seven  elements  and  an  inability  to  perform  more  than  one 
attention-demanding  task  at  a  time.  Performance  is  to  be 
understood  through  the  availability  of  well -practiced  procedures 
that  no  longer  demand  high  levels  of  attention  ( "automat ic  ity" )  ; 
strategies  by  which  actions  are  selected,  monitored,  and,  when 
necessary,  switched  ( "me tacogni t ive  skills");  and  the  mental 
structures  that  relate  facts  and  skills  ("schema").  Learning  is 


11 


to  be  understood  through  the  automatization  of  procedures;  the 
acquisition  and  enhancement  of  metacognitive  skills;  and  the 
cons  true t ion ,  revision,  and  replacement  of  schema. 

Comparing  the  performances  of  novices  and  experts  offers 
insights  into  the  nature  of  performance  and  learning.  A  first, 
unsurprising,  difference  is  that  experts  command  more  facts  and 
concepts  than  novices,  and  have  richer  interconnections  among 
them.  Interconnections  overcome  limitations  of  short  term  memory; 
while  the  novice  may  work  with  seven  distinct  elements,  the  expert 
works  with  seven  constellations  that  embody  relationships  among 
many  elements  ("chunking”).  Moreover,  experts  often  organize 
their  knowledge  in  schemata  possessing  not  simply  more 
connections,  but  qualitatively  different  ones.  The  advanced 
concepts  that  college  phvsics  students  acquire,  for  example,  can 
be  organized  around  informal  associations  or  naive  misconceptions 
(Caramazza,  McCloskey,  and  Green,  1981).  These  novices  tackle 
physics  problems  in  less  effective  ways  than  expert  physicists, 
whose  more  appropriate  schemata  lead  them  to  the  crux  of  the 
matter  (Chi,  Feltovich,  and  Glaser,  1981).  Experts  also  differ 
from  novices  by  having  automatized,  through  study  and  practice, 
procedures  that  were  once  slow  and  attention  consuming,  allowing 
them  to  focus  on  novel  aspects  of  a  problem,  look  from  different 
perspectives,  and  more  efficiently  monitor  and  guide  their  efforts 
as  they  work  (Lesgold  and  Perfetti,  1978). 

The  challenge  to  education  is  to  discover  what  experiences 
help  a  learner  with  a  given  configuration  of  propositions,  skills, 
and  connections  to  reconfigure  that  knowledge  into  a  more  powerful 
arrangement.  Vosniadou  and  Brewer  (1987)  point  to  Socratic 
dialogue  and  analogy  as  mechanisms  that  facilitate  such  learning. 
To  apply  them  effectively,  one  must  take  into  account  not  simply 
target  configurations,  such  as  the  expert's  model,  but  the 
individual  learners'  current  configurations.  The  challenge  to 
test  theory  is  to  provide  models  and  methods  to  assess  knowledge, 
and  to  guide  instruction,  as  seen  in  this  new  light. 


To  what  extent  can  standard  test  theory  meet  this  challenge? 
Recall  that  standard  test  theory  characterizes  performance  only  as 
to  overall  level  of  proficiency,  and  learning  only  as  to  change  in 
overall  proficiency.  Cronbach  and  Furbv  (1970)  note  the 
inadequacy  of  such  measures  of  change  when  applied  with 
conventional  broad  range  educational  tests: 

Even  when  [test  scores’  X  and  Y  are  determined  bv  the 
same  operation  [e.g. ,  scores  under  the  same  CTT  or  IRT 
model; ,  they  often  do  not  represent  the  same 
psychological  processes  (Lord,  1958).  At  different 
stages  of  practice  or  development  different  processes 
contribute  to  performance  of  a  task.  Nor  is  this  merely 
a  matter  of  increased  complexity;  some  processes  drop 
out,  some  remain  but  contribute  nothing  to  individual 
differences  within  an  age  group,  some  are  replaced  bv 
qualitatively  different  processes,  (p.  76). 

Standard  test  scores  can  be  connected  more  closely  with 
cognition  if  they  summarize  performance  over  onlv  tasks  that  are 
very  homogeneous  in  their  requirements  (Glaser,  1963),  and  this 
specificity  marked  the  criterion  referenced  testing  movement  of 
the  1960's  and  1970' s.  Merely  defining  testing  areas  very 
narrowly,  however,  is  not  sufficient  to  make  test  scores 
ins t rue t i ona 1 1 v  relevant  (Glaser,  1981).  A  list  of  scores  in 
narrowly  defined  areas  ignores  the  interconnections  among  scores 
induced  bv  the  knowledge,  skills,  and  strategies  thov  tap  in 
pairs,  in  triples,  or  in  hierarchies  of  the  specific  behaviors-- 
yet  it  is  at  just  this  level  that  instructional  relevance  must  be 
sought . 

New  Tests,  New  Test  Theory 

A  learner's  state  of  competence  at  a  given  point  in  time  is  a 
complex  constellation  of  facts  and  concepts,  and  the  networks  that 
interconnect  them;  of  automatized  procedures  and  conscious 
heuristics,  and  their  relationships  to  knowledge  patterns  that 
signal  their  relevance;  of  perspectives  and  strategies,  and  the 
management  capabilities  bv  which  the  learner  focuses  his  efforts. 
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There  is  no  hope  of  providing  a  complete  description  of  such 
a  state.  Neither  is  there  a  need  to.  The  new  pedagogy  need 
merely( ! )  identify  communal i ties  among  states  of  competence  that 
can  be  linked  to  instructional  actions  that  facilitate  changes  to 
preferable  states.  Distinctions  need  not  be  made  among  all 
possible  states,  but  only  among  classes  of  states  with  different 
instructional  implications.  The  new  tests  to  inform  instructional 
decisions  need  merelv(!)  present  tasks  that  learners  in  the 
different  states  are  likely  to  carry  out  in  observably  different, 
ways.  Not  only  correctly  as  opposed  to  incorrectly,  but  at  what 
speed,  with  what.  intermed:ate  products,  or  with  which  incorrect 
response;  not  simply  as  independent  pieces  of  information  from 
distinct  items,  but:  in  patterns  of  similarity,  dissimilarity,  or 
independence  across  tasks  that  probp  knowledge  structures  and 
problem-solving  strategies.  The  new  test  theory  need  merelvf!) 
provide  models  whose  parameters  are  capable  of  expressing  the 
salient  patterns,  and  inferential  procedures  upon  which  to  base 
instructional  decisions  in  the  presence  of  uncertainty. 

Foundations  of  the  new  pedagogy  are  to  be  found  in  the  union 
of  analyses  of  kev  concepts  in  a  substantive  area,  research  into 
the  cognitive  psychology  of  the  area,  and  detailed  observations  of 
learners  as  they  progress.  Greeno  (1976)  argues  that  the  tools 
and  the  perspectives  of  cognitive  and  educational  psychology  have 
developed  to  a  point  at  which  they  can  be  used  to  generate 
instructional  objectives  In  this  manner.  He  provides  detailed 
illustrations  in  three  substantive  domains  at  increasing  levels  of 
complexity  and  soph i s t ica t ion :  fourth-grade  fractions,  high  school 
geometry,  ,,nd  college  level  audit  cry  psychophysics. 

Foundations  of  the  new  theory  of  test  construction  are 
similarly  to  be  found  in  educational  and  cognitive  psychology 
(Embretson,  1985a;  Messick,  1984).  Standard  vocabulary  items 
suffice  to  ascertain  the  breadth  of  a  learner's  familiarity  with 
concepts  in  a  substantive  area,  but:  tasks  based  on  analogies  probe 
the  interconnections  among  concepts.  Speed  of  response  is  more 
informative  than  correctness  about  the  automat ic i tv  of  procedures. 
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hence  a  better  guide  to  assigning  additional  practice  on  a 
currently  conscious  process.  Designing  appropriate  measures 
demands  familiarity  with  the  substantive  field,  not  just  about  the 
knowledge  structures  of  the  expert  but  about  the  incomplete  or 
inaccurate  structures  novices  often  use.  To  see  how  the  requisite 
cognitive  and  substantive  analyses  might  be  carried  out,  and  how 
tasks  that  differentiate  among  learners  at  different  states  of 
competence  might  then  be  co,.t  t  rue  ted ,  the  reader  is  referred  to 
Cuitis  and  Glaser  (1023)  or.  reading  achievement  and  Marshall 
(1985)  on  "storv  problems"  in  arithmetic. 

Foundations  of  the  new  test  theory  are  to  be  found  in  the 
general  principles  that  led  to  the  development  of  item  response 
theory.  The  examinee  will  he  characterized  by  parameters  that 
express  tendencies  to  act  in  accordance  with  the  various 
continuous  levels  or  discrete  states  in  simplified  models  of 
cognition.  Tasks  will  be  characterized  bv  parameters  indicate  the 
extent  to  which  thev  tap  different  aspects  of  knowledge 
structures,  procedures,  or  strategies.  As  in  1RT,  individual 
differences  among  examinees  that  are  not  salient  to  the  decision 
will  be  modeled  as  random- -not  as  a  psychological  1 v  tenable 
position,  but  as  a  practically  useful  expedient. 

Beyond  "Low-to-High  Proficiency" 

The  breadth  of  problems  to  which  standard  test  theoretic 
models  have  been  usefully  employed,  despite  their  limited  low-to- 
high  conception  of  proficiency,  suggests  a  certain  robustness  ol 
modeling.  It  is  not  necessary  that  models  account  for  all 
possible  wavs  students  might  approach  a  test,  but  it  is  necessary 
that  they  can  capture  instrui t ional iy  relevant  patterns.  A  test 
must  be  designed  to  highlight  the  pertinent  patterns,  and  analyzed 
with  a  model  capable  ol  expressing  them. 

The  idea  of  building  test  items  around  cognitive  principles 
can  be  traced  back  at  least  as  far  as  to  Guttman's  facet  design 
tests  (Guttman,  1970).  Gut r.man  worked  out  analytic  methods  for 
analyzing  data  from  such  tests  within  the  framework  of  classical 


Cest  theory.  Sche iblechner  (1972)  and  Fischer  (1973),  with  their 
"linear  logistic  test  model"  expressed  item  difficulty  parameters 
in  the  Rasch  IRT  model  as  functions  of  psychologically  salient 
features  of  test  items,  but  still  characterized  examinees  in  terms 
of  overall  proficiency.  More  recently,  test  theory  models  built 
around  patterns  other  than  overall  proficiency  have  begun  to 
appear  in  the  psychometric  literature. 

"Tectonic  plate"  models.  Increasing  competence  in  a 
substantive  area  need  not  be  reflected  as  uniformly  increasing 
chances  of  success  on  all  tasks.  Patterns  of  smooth  increase  mav 
be  observed  for  certain  people  on  certain  sets  of  tasks,  in 
certain  phases  of  development;  standard  test  theory  will  give  good 
summaries  of  change  in  these  neighborhoods.  Di  scont  iiiuous 
patterns  of  change  begin  to  appear  as  the  scope  of  tasks  becomes 
broader,  as  the  range  of  development  becomes  greater,  and  as  the 
range  of  experiences  of  examinees  becomes  more  diverse.  "Tectonic 
plate"  models  generalize  IRT  by  allowing  for  a  limited  number  of 
predetermined,  theory-driven,  discontinuities  in  item  response 
patterns.  In  tectonic  plate  geological  models,  points  within  a 
given  land  mass,  or  plate,  maintain  their  relative  positions,  but 
the  plates  move  with  respect  to  one  another.  In  tectonic  plate 
psychometric  models,  items  tapping  the  same  set  of  skills  maintain 
their  difficulties  relative  to  one  another,  but  the  difficulties 
of  the  groups  of  items  change  with  respect  to  other  groups  as 
learners  acquire  new  skills  or  concepts. 

Wilson's  (1985,  1989)  "Saltus"  model  extends  the  Rasch  IRT 
model  to  development  with  discontinuous  jumps.  An  example  is 
Ziegler's  (1981)  rule  -  learning  analysis  of  balance-beam  tasks, 
where  students  can  increase  their  competence  either  bv  using  the 
rules  they  know  more  effectively  (continuous  change)  or  by 
learning  new  rules  (discontinuous  change).  Sometimes  students  who 
learn  a  new  rule  begin  to  miss  a  type  of  problem  thev  used  to  get 
right,  because  their  previous,  less  complete,  set  of  rules  gave 
the  right  answers  for  the  wrong  reasons.  This  pattern  flouts 
standard  test  theory.  The  Saltus  model  assumes  that  each  examinee 
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is  in  one  of  a  number  of  unobservable  stages  of  development. 

~~ems  are  classified  so  that  all  items  in  a  class  have  the  same 
relationship  to  develoDmental  stages.  One  set  of  item  parameters 
expresses  relative  difficulties  among  items  within  item  classes, 
wThich,  like  Rasch  item  difficulty  parameters,  are  the  same  for 
people  in  all  stages.  A  second  set  of  parameter^  quantifies 
patterns  that  the  Rasch  model  cannot  express:  differences  in 
relative  difficulties  between  item  classes  for  people  in  different 
stages,  such  as  the  difficulty  reversals  mentioned  above.  Saltus 
is  effectively  a  mixture  of  standard  Rasch  models. 

Mislevy  and  Verhelst  (in  press)  have  discussed  mixture  models 
more  generally,  listing  assumptions,  laving  out  general  models, 
and  suggesting  estimation  procedures.  They  emphasize  situations 
in  which  different  subjects  follow  different  strategies,  pointing 
out  uihl  instructional  decisions  can  depend  on  how  students  solve 
problems,  not  just  how  many  they  solve.  The  salient  features  of 
items  are  those  that  can  differentiate  among  users  of  different 
strategies,  mental  models,  or  conceptions  about  key  relationships. 
An  examinee  is  characterized  by  the  probabilities  that  she 
employed  the  various  alternative  strategies,  and  a  conditional 
estimate  of  proficiency  under  each.  Measurement  with  such  a  model 
can  indicate  change  that  is  either  quantitative  (e.g.,  the 
examinee  employed  Strategy  A  on  both  occasions,  but  more 
effectively  at  the  second)  or  qualitative  (f  .g. ,  she  used  Strategy 
A  before  instruction  but  Strategy  B  afterwards). 

Latent  class  models.  Although  models  with  continuous  latent 
variables  have  dominated  educational  measurement,  Lazarsfeld 
(1950)  introduced  models  with  categorical  latent  variables  nearly 
half  a  century  ago.  Most  educational  applications  of  latent  class 
models  have  been  in  "mastery"  testing;  one  attempts  to  infer  an 
examinee's  unobservable  state - -roaster  or  nonmas ter  -  -  on  the  basis 
of  observable  responses  (Macready  and  Dayton,  1977,  1980).  In  the 
more  recent  "binary  skills"  models  (Haertel,  1984),  examinees  are 
classified  in  terms  of  which  of  a  set  of  skills  they  possess. 

This  "true"  classification  is  unobservable  Items  are  classified 
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according  to  which  ui  the  skills  they  require  for  solution.  This 
classification  is  known.  Ideally,  an  examinee  responds  correctly 
to  only  and  exactly  those  items  that  require  skills  he  or  she 
possesses.  The  s t  jchastic  parameters  of  the  model  reflect 
departures  from  this  ideal. 

Except  in  the  special  case  of  mastery  testing,  computational 
constraints  have  limited  applications  of  latent  class  models  to  no 
more  than  about  ten  items  until  recently.  Information  about  skill 
profiles  in  groups  can  be  gleaned  from  such  data,  but  individuals' 
skills  could  not  be  inferred  accurately.  Improved  computational 
procedures  have  opened  the  door  to  applications  with  30  or  60 
items  (e.g.,  Paulson,  1986;  Yamamoto,  1987),  and  work  with 
structurally  similar  models  in  expert  systems  holds  promise  of 
handling  much  larger  problems  (Lauritzen  and  Spiege lhal ter ,  1988  ). 
Progress  in  this  direction  is  vital  to  educational  applications, 
since  these  inferences  demand  more  data  than  low- co-high 
proficiency  inferences.  Moreover,  adaptive  testing,  which  made 
IRT  measurement  more  efficient,  will  be  able  to  make  latent  class 
measurement  practicable  (Dayton  and  Macready.  1989;  Falmagrie  and 
Doignon ,  1988 )  . 

Component,  ial  models.  The  models  described  above  were 
introduced  with  right/wrong  test  items,  which,  if  constructed 
carefully,  yield  response  patterns  that  differentiate  examinees 
who  tackle  them  in  different  ways.  Richer  information  can  be 
accumulated  if  it  is  possible  to  track  intermediate  products  oi 
solution.  Consider,  for  example,  a  situation  in  which  the  binary 
skills  model  applies.  Inferences  about  skill  profiles  can  be 
stronger  if  one  can  be  see  which  subtasks  were  attempted  and  their 
outcomes:  overall  correctness  can  result  from  one  sequence  of 
correct,  operations  or  another,  or  a  fortuitous  mixture  of  correct 
and  incorrect  operations;  overall  incorrectness  can  be  caused  bv  a 
poor  plan  of  attack,  or  a  flawed  execution  of  a  good  plan.  Early 
implementations  of  these  ideas  have  been  worked  out  bv  Embretson 
(  1  98  3,  1'hSSb)  and  Sainejima  (1983). 
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All  of  the  models  discussed  above -- tec  tonic  plate,  latent 

class,  and  component ia 1  mode  Is  - - exhibi t  the  same  cardinal  feature: 

they  support  i nf e  rences  about  prof i c ienc ies  other  than  just  low  - 

to-high  ability  because,  and  only  because,  the  user  specifies 

theoretically  salient  patterns  of  response  other  than  just  less- 

to-more  correct  answers.  Current  implementations  require 

expertise  in  statistics  as  well  as  in  the  substantive  area.  Test 

theory  researchers  must  embed  these  approaches  in  generally 

applicable  computer  routines,  or  shells,  so  that  a  broader  range 

1 

of  users  can  put  them  into  practice  in  the  substantive  areas. 

Beyond  Right/Wrong,  Mul t iple -Choice  Items 

Currently  IRT  is  used  almost  exclusively  to  draw  inferences 
about  a  low-to-high  proficiency  variable  from  responses  to 
multiple-choice  test  items.  The  preceding  section  discussed  how. 
even  with  mul t iple - cho ice  data,  one  can  found  inferences  upon 
radically  different  conceptions  of  proficiency.  Inferences  can  be 
made  yet  stronger,  and  decision-making  more  efficient,  if 
different  kinds  of  data  can  be  collected. 

We  have  mentioned  the  possibility  of  exploiting  the  identity 
of  incorrect  responses  to  mul  t  iple-choi  t:e  items,  for  when 
particular  misconceptions  are  probed  in  more  than  one  item  and  we 
wish  to  infer  how  an  examinee  is  approaching  tasks.  IRT  models 
that  distinguish  among  incorrect  alternatives  have  been  discussed 
by  Bock  (1972),  Masters  (1982),  Samej ima  (1979),  and  Thissen  and 


Similar  diffusion  processes  have  already  occurred  in  two 
area  related  to  test  theory.  The  first  is  IRT  itself.  In  t  ho 
1960's,  only  a  handful  of  mathematically  talented  researchers 
could  use  IRT:  now  IRT  is  widely  used  by  practitioners  by  virtue 
of  production  programs  such  as  LOGIST  (Wingerskv,  Barton,  and 
Lord,  1982),  BILOG  (Mislevy  and  Bock,  1983),  and  BICAL  (Wright, 
Mead,  ard  Bell,  1980).  The  second  area  is  that  of  linear 
structural  relationships  among  variables  with  measurement  error. 
Proposing  such  a  model  and  solving  the  equations  was  once 
practically  grounds  for  a  Nobel  prize  in  economics;  r.ow  anvone 
with  access  to  the  LISREL  computer  program  (Jore^Kog  and  Sorbom, 
1986)  can  routinely  carry  out  analyses  undreamed  of  a  few  decades 
ago  . 
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Steinberg  (1984).  These  papers  show  how  to  connect  observat ions 
more  complex  than  right/wrong  to  the  standard  psychological  mode  1 
of  low-to-high  proficiency.  The  same  machinery  for  t he 
observational  aspect  of  modeling  can  be  used  when  the 
us  vc  ho  logical  aspect  is  an  alternative  cognitive  model.  Kmhret 
(  1983,  1983b)  and  Masters  (Masters  and  Mi  slew ,  1989 ;  have  taken 
some  initial  steps  in  this  direction. 

Because  data  collected  on  computers  can  provide  response  t  i;r> 
routinely,  response  latency  can  also  be  exploited.  Response 
latencies  are  particularly  pertinent  to  inferences  about 
automatic! tv:  a  correct  answer  arrived  at  through  a  laborious 
conscious  process  can  have  different  instructional  impl icat ions 
than  the  same  response  obtained  through  automatized  processes  . 
Response  latencies  can  also  he  used  in  conjunction  with 
correctness  to  de-sign  items  that  differentiate  among  examinees  who 
use-  different  strategies.  Manv  quantitative  items  in  the  SAT.  for 
example,  can  be  solved  either  bv  a  "brute  force"  calculation  or  bv 
a  simple  calculation  it  a  kev  relationship  is  recognized:  "correct 
and  fast"  suggests  the  insightful  solution.  Sche  i  b  1  echne  r  - 
and  Thissc::  .  I'oSi;  show  how  to  use  response  times  to  measure  low- 
to-high  prof i c u rev  Their  methods  of  linking  observed  responses 
to  expeited  responses  could  he  applied  with  an  alternative 
cognitive  model  for  expected  responses . 

Beyond  Tester  -  Control  Led  Observational  Settings 

!"r  a(i  i  :  i  on..i  1  educat  ional  tests  present  small,  c  l  used  -  i  ovm 
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traditional  tests  and  the  whole Iv  unstructured  observation  of 
performance  in  natural  settings. 

The  most  wo^k  in  this  area  nas  been  carried  out  in  the  arena 
of  medical  education  in  the  form  jf  "patient  management,  problems," 
or  PMPs  (Assmann,  Hixon.  and  Kacmarek,  1979).  A  simulated  patient 
(through  a  written  or  oral  dialogue,  or  as  a  live  actor  or  a 
computer  model)  presents  the  examinee  witn  initial  svmptoms;  the 
examinee  requests  tests,  considers  their  results,  prescribes 
treatments,  and  monitors  their  effects,  generally  attempting  to 
identify  and  treat  the  initially  unknown  disease.  Despite  their 
appeal  as  evocators  of  critical  problem-solving  skills,  f.MPs  do 
not  seem  to  provide  reliable  data  from  the  perspective  of  standard 
test  theoretic  techniques  (McGuire,  lQ8:i>).  For  the  same  .amount  ol 
testing  time,  reliability  coefficients  of  PMP  scores  pro  /e 
disappointingly  low  compared  with  mul t i pie -choice  tests. 

A  possible  explanation  of  this  result  is  that  standard  test 
theory  analyst's  of  PMP  data  are-  not  looking  for  the  right 
patterns.  They  look  at  simple  additive  combinations  of  single 
outcomes,  rather  than  relationships  that  might  suggest 
associations  among  facts  in  examinees’  schema,  or  indicate  the  us. 
of  effective  or  ineffective  problem-solving  strategies.  A 
distinct  stream  of  medical  research,  however,  does  address  t  lu-se 
relationships:  "expert  systems"  that  help  health  care  workers  v: • h 
diagnostic  problems  (e.g.,  Pope,  1981;  Short  1  it te  et  a  1  .  .  1  '*  '  i  > 

An  expert  system  representation  of  a  health  care  area  is 
build  around  associations  among  unobservable  disease  states, 
observable  svmptoms  and  test  results,  and  outcomes  ot  t  ivutnvnt  :• 
Some  expert  systems  express  these  associations  through  "  t  u::::v 
logic"  (Zadeh,  198’)  or  "belief  functions"  (Shafer.  1‘‘  -r>  >  .  tint  : 
ones  t  ha*  use  conditional  probabilities  (Spiegeihal ter,  1  at. 

extensions  of  the  latent  class  models  discussed  above.  In  an 
educat  iona  1  setting,  associations  would  tie  delineated  amour, 
substantive  concepts,  strategies  .  observable  out  com,-.  s  _  and 
prescribed  instruction  (Glancev,  1988). 


There  are  two  levels  at  which  expert  systems  could  be 
implemented  in  education  il  settings.  The  first  appears  more 
amenable  to  end  -  of  -  course  or  macro-level  decision-making,  while 
the  second  seems  better  suited  to  an  ongoing  instructional  system. 

In  the  first,  simpler,  approach,  an  expert  system  is  built 
only  for  a  "correct"  model.  An  examinee's  responses  are  evaluated 
in  terms  of  their  efficacy  at  each  decision  point  as  compared  with 
the  best  possible  action  given  present  information.  If  scores 
were  also  available  from  a  standard  mul t ip  1 e -cho ice  test  of 
knowledge,  one  could  distinguish  performance  problems  caused  by 
strategic  errors  from  those  caused  by  knowledge  deficiencies. 

In  the  second,  more  ambitious,  approach,  not  only  would  a 
correct  expert  system  be  built,  but.  examinees'  possibly  "inexpert 
svstems"  would  be  inferred.  Perhaps  the  best  known  example  of 
this  type  is  Anderson's  (Anderson  and  Reiser,  1985)  computer 
programming  tutor.  Although  more  individualized  instructional 
prescriptions  can  be  made  in  this  way,  inferring  even  selected 
aspects  of  examinees'  schema  and  strategies  requires  far  more  data 
than  does  comparing  performance  to  a  fixed  expert  model.  A 
successful  system  of  this  type  would  probably  require  a  more 
constrained  problem  space  and  more  extensive  interactions  of  the 
learner  with  the  simulation. 

Conclus ion 

Einstein's  theory  of  relativity  revolutionized  physics,  but 
it  extended  rather  than  supplanted  Newton's  laws  of  motion. 

lassical  mechanics  still  works  just  fine,  thank  you,  for  building 
bridges,  planning  billiards  shots,  and  figuring  out  how  to  stand 
up  from  a  overstuffed  easy  chair.  And  as  long  as  educators  are 
called  upon  to  make  the  macro-level,  1  inear ly - ordered  decisions 
that  engendered  standard  test  theory,  standard  test  theory  will 
continue  to  be  useful,  and  will  continue  to  he  used.  Recent 
developments  in  technology,  however,  provide  opportunities  for 
decision  making  at  the  micro-level  more  frequently  and  for  larger 
numbers  ot  students  ,  an  ever  before;  recent  developments  in 


education  and  psychology  give  us  conceptions  of  competence  and 
learning  that  can  be  used  to  guide  these  decisions. 

Researchers  in  education  and  psychology  have  begun  to  lav  the 
theoretical  groundwork  to  link  testing  with  the  cognitive 
processes  of  learning.  Meanwhile,  researchers  in  measurement  and 
statistics  have  made  breakthroughs  in  inferential  procedures  for 
the  models  of  standard  test  theory.  To  inform  modern  educational 
decisions  requires  drawing  together  the  insights  from  these  two 
strands  of  research -- the  twin  foundations  of  a  new  test  theory. 
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Educational  Testing  Service 
Princeton,  NJ  08541 

Ms.  Rebecca  Hetter 

Navy  Personnel  R&D  Center 

Code  63 

San  Diego,  CA  92152-6800 

Dr .  Paul  W .  Holland 
Educational  Testing  Service, 
Roseda I e  Road 
Princeton,  NJ  08541 

Prof .  Lutz  F .  Hornke 
Institut  fur  Psychologie 
RWTH  Aachen 
Jaegers trasse  17/19 
D-5100  Aachen 
WEST  GERMANY 

Dr .  Paul  Horst 
677  G  Street,  #184 
Chula  Vista,  CA  92010 
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Mr.  Dick  Hoshaw 

rr’-u5 

Arlington  Annex 
Room  2834 

Washington,  DC  20350 

Dr.  Lloyd  Humphreys 
University  of  Illinois 
Department  of  Psychology 
603  East  Daniel  Street 
Champaign,  IL  61820 

Dr .  Steven  Hunka 
3-104  Educ.  N. 

University  of  Alberta 
Edmonton ,  Alberta 
CANADA  T6G  2G5 

Dr.  Huynh  Huynh 
College  of  Educat i on 
Un i v .  of  South  Carolina 
Co i umb i a ,  SC  29208 

Dr.  Robert  Jannarone 
Elec,  and  Computer  Eng.  Dept. 
Un;versity  of  South  Carolina 
Co  I umb i a ,  SC  29208 

Dr.  Douglas  H.  Jones 
Thatcher  Jones  Associates 
P.0.  Box  6640 
10  T raf  a  I  gar  Court 
Lawrencevi  lie,  NJ  08648 

Dr.  Brian  Junker 
University  of  Illinois 
Department  of  Statistics 
101  I  I  I  i n i  Hall 
725  South  Wright  St. 
Champaign,  II  61820 

Dr.  Milton  S.  Katz 
European  Science  Coordination 
Office 

U.S.  Army  Research  Institute 
Box  65 

FPO  New  ,'ork  09510-1500 

Prof.  John  A.  Keats 
Department  of  Psychology 
University  of  Newcastle 
N.S.W.  2308 
AUSTRALIA 


Dr.  G.  Gage  Kingsbury 

Portland  Public  Schools 

Research  and  Evaluation  Department 

501  North  Dixon  Street 

P.  0.  Box  3107 

Portland,  OR  97209-3107 

Dr.  William  Koch 
Box  7246,  Meas.  and  Eval.  Ctr. 
University  of  Texas-Austin 
Austin,  TX  78703 

Dr.  Leonard  Kroeker 
Navy  Personnel  R&D  Center 
Code  62 

San  Diego,  CA  92152-6800 

Dr.  Jerry  Lehnus 

Defense  Manpower  Data  Center 

Su i te  400 

1600  W i  I  son  B I vd 

Rosslyn,  VA  22209 

Dr.  Thomas  Leonard 
University  of  Wisconsin 
Department  of  Statistics 
1210  West  Dayton  Street 
Madison,  WI  53705 

Dr .  Mi  chae I  Levine 
Educational  Psychology 
210  Education  Bldg. 

Un  i  vers i ty  of  Illinois 
Champaign,  IL  61801 

Dr.  Char  I es  Lewi s 
Educational  Testing  Service 
Princeton,  NJ  08541-0001 

Dr.  Robert  L.  Linn 
Campus  Box  249 
University  of  Colorado 
Boulder,  CO  80309-0249 

Dr.  Robert  Lockman 
Center  for  Naval  Analysis 
4401  Ford  Avenue 
P.0.  Box  16268 
Alexandria,  VA  22302-0268 

Dr.  Frederi-  M.  Lord 
Educational  Testing  Service 
Princeton,  NJ  08541 
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Dr.  George  B.  Macready 
Department  of  Measurement 
Statistics  &  Evaluation 
College  of  Education 
University  of  Maryland 
College  Park,  MD  20742 

Dr.  Gary  Marco 
Stop  31-E 

Educational  Testing  Service 
Princeton,  NJ  08451 

Dr.  James  R.  McBride 
The  Psychological  Corporation 
1250  Sixth  Avenue 
San  Diego,  CA  92101 

Dr.  Clarence  C.  McCormick 
HQ,  USMEPCOM/MEPCT 
2500  Green  Bay  Road 
North  Chicago,  IL  60064 

Dr .  Robert  McK i n I ey 

Law  School  Admission  Services 

Box  40 

Newtown,  PA  18940 

Dr .  James  McM i chae I 
Technical  Director 
Navy  Personnel  R&D  Center 
San  Diego,  CA  92152-6800 

Dr.  Robert  Mislevy 
Educational  Testing  Service 
Princeton,  NJ  0854 1 

Dr .  William  Montague 

NPRDC  Code  13 

San  Diego,  CA  92152-6300 

Ms.  Kathleen  Moreno 
Navy  Personnel  R&D  Center 
Code  62 

San  Diego,  CA  92152-6800 

Headquarters  Marine  Corps 
Code  MPI-20 
Washington,  DC  20380 

Dr.  W.  Alan  Nicewander 
University  of  Oklahoma 
Department  of  Psychology 
Norman,  OK  73071 


Deputy  Technical  Director 

NPRDC  Code  01A 

San  Diego,  CA  92152-6800 

Director,  Training  Laboratory, 
NPRDC  (Code  Of) 

San  Diego,  CA  92152-6800 

Director,  Manpower  and  Personnel 
Laboratory , 

NPRDC  (Code  06) 

San  Diego,  CA  92152-6800 

Director,  Human  Factors 

&  Organizational  Systems  Lab, 
NPRDC  (Code  07) 

San  Diego,  CA  92152-6800 

L  i  b rary ,  NPRDC 
Code  P201L 

San  Diego,  CA  92152-6800 

Commanding  Officer, 

Naval  Research  Laboratory 
Code  2627 

Washington,  DC  20390 

Dr.  Harold  F.  O'Neil,  Jr. 

School  of  Education  -  WPH  801 
Department  of  Educational 
Psychology  &  Technology 
University  of  Southern  California 
Los  Angeles,  CA  90089-0031 

Dr.  James  B.  Olsen 
WICAT  Systems 
1875  South  State  Street 
Orem,  UT  84058 

Office  of  Naval  Research, 

Code  1142CS 
800  N.  Quincy  Street 
Arlington,  VA  22217-5000 
(6  Copies) 

Office  of  Naval  Research, 

Code  125 

800  N.  Quincy  Street 
Arlington,  VA  22217-5000 
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Assistant  for  MPT  Research, 
Development  and  Studies 
OP  01B7 

Washington,  DC  20370 

Dr.  Judith  Orasanu 
Basic  Research  Office 
Army  Research  Institute 
5001  Eisenhower  Avenue 
Alexandria,  V A  22333 

Dr .  Jesse  Or  I  ansky 

Institute  for  Defense  Analyses 

1801  N.  Beauregard  St. 

Alexandria,  VA  22311 

Dr  .  Peter  J .  Pash  I  ey 
Educational  Testing  Service 
Rcsedale  Road 
Pr i nceton ,  NJ  08541 

Wayne  M.  Patience 
American  Council  on  Education 
GED  Testing  Service,  Suite  20 
One  Dupont  Circle,  NW 
Washington,  DC  2003E 

Dr .  James  Paulson 
Department  of  Psychology 
Portland  State  University 
P.G.  Box  751 
Portland,  OR  97207 

Dept,  of  Administrative  Sciences 
Code  54 

Naval  Postgraduate  School 
Monterey,  CA  93943-5026 

Department  of  Operations  Research, 
Naval  Postgraduate  School 
Monterey,  CA  93940 

Dr.  Mark  D.  Reckase 
AC  T 

P.  0.  Box  168 
Iowa  City,  IA  52243 

Dr.  Malcolm  Ree 
AFHRL/MOA 

Brooks  APB,  TX  78235 


UP  9 

ting  Se r v i c e/M i s I e v y 


Mr.  Steve  Reiss 
N660  Elliott  Hall 
University  of  M  mesota 
75  E.  River  Road 
Minneapolis,  MN  55455-034^ 

Dr.  Carl  Ross 
CNET-PDCD 
Building  30 

Great  Lakes  NTC,  IL  60088 
Dr.  J.  Ryan 

Department  of  Education 
University  of  South  Carol Ina 
Columbia,  SC  29208 

Dr.  Fumiko  Samejima 
Department  of  Psychology 
University  of  Tennessee 
310B  Austin  Peay  Bldg. 
Knoxville,  TN  37916-0900 

Mr.  Drew  Sands 

NPRDC  Code  62 

San  Diego>  CA  92152-6800 

Lowe  II  Sc  hoe  r 

Psychological  &  Quantitative 
Foundations 
College  of  Education 
University  of  I owa 
Iowa  City,  IA  52242 

Dr.  Mary  Schratz 
905  Orch i d  Way 
Carlsbad,  CA  92009 

Dr .  Dan  Segal  1 

Navy  Personnel  R&D  Center 

San  Diego,  CA  92152 

Dr.  W.  Steve  Sell  man 
OASD  (MRA8.L  > 

2B269  The  Pentagon 
Washington,  DC  20301 

Dr .  Kazuo  Shi gemasu 
7-9-24  Kugenuma-Ka i qan 
Fuji  sawa  25 1 
JAPAN 
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Dr.  William  Sims 
Center  for  Naval  Analysis 
440 1  Ford  Avenue 
P.0.  Box  16268 
Alexandria,  V A  22302-0268 

Dr.  H .  Wallace  Sinaiko 
Manpower  Research 

and  Advisory  Services 
Smithsonian  Institution 
801  North  Pitt  Street,  Suite  120 
Alexandria,  VA  22314-1713 

Dr.  Richard  E.  Snow 
School  of  Education 
Stanford  University 
Stanford,  CA  9430b 

Dr.  Richard  C.  Sorensen 
Navy  Personnel  R&D  Center 
San  Diego,  CA  92152-6800 

Dr.  Judy  Spray 
ACT 

P . 0 .  Box  168 

Iota  City,  I A  5224  3 

Dr.  Martha  Stocking 
Educational  Testing  3  e  r  v  e 
Princeton,  NJ  08541 

Dr.  Peter  Stoloff 
Center  for  Naval  Analysis 
4401  Ford  Avenue 
P.0.  Box  16268 
Alexandria,  VA  22302-0268 

Dr.  William  Stc^t 
University  of  Illinois 
Department  of  Statistics 
101  III  ini  Hall 
725  South  Wright  St. 

Champa i gn ,  IL  61820 

Dr.  Hariharan  Swaminathan 
Laboratory  of  Psychometric  and 
Evaluation  Research 
School  of  Education 
University  if  Massachusetts 
Amherst,  MA  01003 


Testing  Se r v i c e/M i s  I  e v y 


Mr.  Brad  Sympson 

Navy  Personnel  R&D  Center 

Code-131 

San  Diego,  CA  92152-6800 

Dr.  John  T angney 
AFOSR/NL ,  Bldg.  410 
Bolling  AFB ,  DC  20332-6448 

Dr.  Kikumi  Tatsuoka 
CERL 

252  Engineering  Research 
Laboratory 

103  S.  Mathews  Avenue 
Urbana,  IL  61801 

Dr.  Maurice  Tatsuoka 
220  Education  Bldg 
1310  S.  Sixth  St. 
Champaign,  IL  61820 

Dr  .  Dav id  Th  i  ssen 
Department  of  Psychology 
University  of  Kansas 
Lawrence,  KS  66044 

Mr.  Gary  Thomasson 
University  of  Illinois 
Educational  Psychology 
Champaign,  IL  61820 

Dr.  Robert  Tsutakawa 
University  of  Missouri 
Department  of  Statistics 
222  Math.  Sciences  Bldg. 

Co  I umb  i  a ,  MO  652 1 1 

Dr.  Ledyard  Tucker 
University  of  Illinois 
Department  of  Psychology 
603  E.  Daniel  Street 
Champaign,  IL  61820 

Dr.  David  Vale 
Assessment  Systems  Corp. 
2233  University  Avenue 
Su i te  440 
St .  Pau  I  ,  MN  55114 

Dr.  Frank  L.  Vicino 
Navy  Personnel  R&D  Center 
San  Diego,  CA  92152-6800 
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Dr .  Howard  Wa i ne r 
Educational  Testing  Service 
Princeton,  NJ  08541 

Dr .  Mi ng-Me i  Wang 
L i ndqu i st  Center 
for  Measurement 
University  of  I owa 
Iowa  City,  IA  52242 

Dr .  Thomas  A .  Warm 
FAA  Academy  AAC934D 
P.O.  Box  25082 
Oklahoma  City,  OK  73125 

Dr.  Brian  Waters 
HumRRO 

12908  Argyle  Circle 
Alexandria,  VA  22314 

Dr  .  Dav id  J .  Weiss 
N660  Elliott  Hall 
University  of  Minnesota 
75  E.  R ; ver  Road 
Minneapolis,  MN  55455-0344 

Dr.  Ronald  A.  We i tzman 
Box  146 

Carme I ,  CA  93921 

Major  John  We  I sh 
AFHRL/MOAN 

Brooks  AFB,  TX  78223 

Dr.  Doug  I  as  Wetze I 
Code  51 

Navy  Personnel  R&D  Center 
San  Diego,  CA  92152-6800 

Dr.  Rand  R.  Wilcox 
University  of  Southern 
Cal  i f  o  r n i a 

Department  of  Psychology 
Los  Angeles,  CA  90089-1061 

German  Military  Representat  i  ve 
ATTN:  Wolfgang  Wildgrube 
Stre i tk  raef  teamt 
D-5300  Bonn  2 

4000  Brandywine  Street,  NW 
Washington,  DC  20016 


Dr.  Bruce  Williams 
Department  of  Educational 
Psychology 

University  of  Illinois 
Urbana,  IL  61801 

Dr.  Hilda  Wina 
NRC  MH-176 

2101  Constitution  Ave. 
Washington,  DC  20418 

Mr.  John  H.  Wolfe 

Navy  Personnel  R&D  Center 

San  Diego,  CA  92152-6800 

Dr.  George  Wong 
Biostatistics  Laboratory 
Memorial  SI oan-Ketter i ng 
Cancer  Center 
1275  York  Avenue 
New  York,  NY  10021 

Dr.  Wallace  Wulfeck,  III 
Navy  Personnel  R&D  Center 
Code  51 

San  Diego,  CA  92152-6800 

Dr.  Kentaro  Yamamoto 
03-T 

Educational  Testing  Service 
Rosedale  Road 
Princeton,  NJ  08541 

Dr.  Wendy  Yen 
CTB/McGraw  Hill 
Del  Monte  Research  Park 
Monterey,  CA  93940 

Dr.  Joseph  L.  Young 
National  Science  Foundation 
Room  320 

1800  G  Street,  N.W. 
Washington,  DC  20550 

Mr.  Anthony  R.  Zara 
National  Council  of  State 
Boards  of  Nursing,  Inc. 
625  North  Michigan  Avenue 
Suite  1 544 
Ch  i  cago  ,  IL  6061 1 


■  d  ij  c  a  t  :  o  n  3 


Dr.  R a  t  n  a  N  a  n  d  a  k  u  m  a  r 

Dept,  of  Educational  studies 

Willard  Hall,  Room  213 

UM  I  vy  r  b  i  i  y  u  f  D  6  I  6  w  ci  i'  tr 

Newark ,  DE  19716 


